Site-Independent Template-Block Detection
نویسندگان
چکیده
Detection of template and noise blocks in web pages is an important step in improving the performance of information retrieval and content extraction. Of the many approaches proposed, most rely on the assumption of operating within the confines of a single website or require expensive hand-labeling of relevant and non-relevant blocks for model induction. This reduces their applicability, since in many practical scenarios template blocks need to be detected in arbitrary web pages, with no prior knowledge of the site structure. In this work we propose to bridge these two approaches by using within-site template discovery techniques to drive the induction of a site-independent template detector. Our approach eliminates the need for human annotation and produces highly effective models. Experimental results demonstrate the usefulness of the proposed methodology for the important applications of keyword extraction, with relative performance gain as high as 20%.
منابع مشابه
Automatic Detection of Webpages that Share the Same Web Template
Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of template extraction is identifying a set of webpages with the same template without having to load and analyze too many webpages prior to identifying the tem...
متن کاملFace Detection Algorithm Based on Multi-orientation Gabor Filters and Feature Fusion
In order to enhance the accuracy of multi-pose and multi-expression face detection, this paper proposes an algorithm based on multi-orientation Gabor feature fusion of mean and variance of subimages. Firstly, to remove the huge background regions, we segmented images based on YCbCr space and then used two-eye templates to locate faces in skin-color regions by template matching. Secondly, the fe...
متن کاملRequirement of Replication Checkpoint Protein Kinases Mec1/Rad53 for Postreplication Repair in Yeast
UNLABELLED DNA lesions in the template strand block the replication fork. In Saccharomyces cerevisiae, replication through DNA lesions occurs via a Rad6/Rad18-dependent pathway where lesions can be bypassed by the action of translesion synthesis (TLS) DNA polymerases η and ζ or by Rad5-mediated template switching. An alternative Rad6/Rad18-independent but Rad52-dependent template switching path...
متن کاملمقایسهی دو روش مولکولی PCR و LAMP در تشخیص سالمونلا
Background and Objective: There are several techniques for the diagnosing of salmonella infectious. Several molecular methods such as PCR and hybridization assay have recently been used for the detection of this bacterium. However, these methods require precision instruments for amplification and complex procedures, which are the major obstacles to the widespread use of these methods in relativ...
متن کاملTemplate matching based on quadtree Zernike decomposition
In this paper a novel technique for rotation independent template matching via Quadtree Zernike decomposition is presented. Both the template and the target image are decomposed by using a complex polynomial basis. The template is analyzed in block-based manner by using a quad tree decomposition. This allows the system to better identify the object features. Searching for a complex pattern into...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007